Information Retrieval and Large Text Structured Corpora
نویسندگان
چکیده
Conventional Information Retrieval Systems (IRSs), also called text indexers, deal with plain text documents or ones with a very elementary structure. These kinds of system are able to solve queries in a very efficient way, but they cannot take into account tags which mark different sections, or at best this capability is very limited. In contrast with this, nowadays, documents which are part of a corpus often have a rich structure. They are structured using XML (Extensible Markup Language) [1] or in some other format which can be converted to XML in a more or less simple way. So, building classical IRSs to work with these kinds of corpus will not benefit from this structure and results will not be improved. In addition, several of these corpora are very large and include hundreds or thousands of documents which in turn include millions or hundreds of millions of words. Therefore, there is the need to build efficient and flexible IRSs which work with large structured corpora. There are several examples of IRSs based on corpora [2] [3], of search methods over large corpora [4], and Chaudhri et al. [5] even introduce a review of different technologies that can be used to build generic IRSs based on XML. However, there are no comparative analyses or studies about technologies that can be used to build IRSs based on large structured corpora. Since these IRSs can be wide ranging, in this work we will focus on those which work with corpora that do not include any morphosyntactic annotation and are structured in XML format. All topics studied in this paper will also be useful for annotated corpora (although the study need to be completed for the latter) or for corpora without XML format (because if corpora are correctly structured, they can be easily converted to XML format).
منابع مشابه
The Need of Structured Data: Introducing the OKgraph Project
Although many computational problems can be approached using Deep Learning, in this position paper we argue that in the case of Information Retrieval tasks this is not mandatory and even detrimental whenever alternatives exist. Instead of learning (by training) how to solve the full problem, we suggest to split it into two sub-problems: a) producing structured data (specifically knowledge graph...
متن کاملSpeech Recognition and Information Retrieval: Experiments in Retrieving Spoken Documents
The Informedia Digital Video Library Project at Carnegie Mellon University is making large corpora of video and audio data available for full content retrieval by integrating natural language understanding, image processing, speech recognition and information retrieval. Information retrieval of from corpora of speech recognition output is critical to the project’s success. In this paper, we out...
متن کاملExploiting the Web as Parallel Corpora for Cross- Language Information Retrieval
The expansion of the Web creates more requirements for Cross-Language Information Retrieval (CLIR). Query translation is the key problem. Previous studies have shown that query translation can be done by exploiting a large set of parallel texts. However, the problem arisen is the unavailability of large parallel corpora for many languages. In this paper, we describe a mining system that automat...
متن کاملQuestion Answering via Integer Programming over Semi-Structured Knowledge
Answering science questions posed in natural language is an important AI challenge. Answering such questions often requires non-trivial inference and knowledge that goes beyond factoid retrieval. Yet, most systems for this task are based on relatively shallow Information Retrieval (IR) and statistical correlation techniques operating on large unstructured corpora. We propose a structured infere...
متن کاملWebBANC: Building Semantically-Rich Annotated Corpora from Web User Annotations of Minority Languages
Annotated corpora are sets of structured text used to enable Natural Language Processing (NLP) tasks. Annotations may include tagged parts-of-speech, semantic concepts assigned to phrases, or semantic relationships between these concepts in text. Building annotated corpora is labor-intensive and presents a major obstacle to advancing machine translators, named entity recognizers (NER), part-ofs...
متن کامل